Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model
Identifieur interne : 000783 ( Main/Exploration ); précédent : 000782; suivant : 000784Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model
Auteurs : SIYUAN CHEN [États-Unis] ; Dharitri Misra [États-Unis] ; George R. Thoma [États-Unis]Source :
- Proceedings of SPIE, the International Society for Optical Engineering [ 0277-786X ] ; 2010.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Recherche documentaire.
English descriptors
- KwdEn :
Abstract
In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000163
- to stream PascalFrancis, to step Curation: 000614
- to stream PascalFrancis, to step Checkpoint: 000157
- to stream Main, to step Merge: 000788
- to stream Main, to step Curation: 000783
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model</title>
<author><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">10-0429695</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 10-0429695 INIST</idno>
<idno type="RBID">Pascal:10-0429695</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000163</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000614</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000157</idno>
<idno type="wicri:doubleKey">0277-786X:2010:Siyuan Chen:efficient:automatic:ocr</idno>
<idno type="wicri:Area/Main/Merge">000788</idno>
<idno type="wicri:Area/Main/Curation">000783</idno>
<idno type="wicri:Area/Main/Exploration">000783</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model</title>
<author><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint><date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Document retrieval</term>
<term>Error analysis</term>
<term>Error estimation</term>
<term>Implementation</term>
<term>Joint</term>
<term>Lexicon</term>
<term>Medical application</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>n gram model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Calcul erreur</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Reconnaissance optique caractère</term>
<term>Implémentation</term>
<term>Application médicale</term>
<term>Lexique</term>
<term>Articulation</term>
<term>Modèle de n grams</term>
<term>Estimation erreur</term>
<term>0130C</term>
<term>4230S</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Maryland</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Maryland"><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
</region>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000783 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000783 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:10-0429695 |texte= Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model }}
This area was generated with Dilib version V0.6.32. |